Skip to content

mongodb: various test fixes#4303

Open
mmatczuk wants to merge 9 commits intomainfrom
mmt/mongodb-fixes-v2
Open

mongodb: various test fixes#4303
mmatczuk wants to merge 9 commits intomainfrom
mmt/mongodb-fixes-v2

Conversation

@mmatczuk
Copy link
Copy Markdown
Contributor

Commits

  • fix data race on ChangeStream between TryNext and ResumeToken
  • fix ResumeWithSnapshot test expecting error on clean shutdown
  • tolerate context.Canceled in RunAsync helper
  • reduce sleep times in ResumeAfterSnapshotWithoutChanges test
  • handle null fullDocument in update_lookup mode
  • increase cdc package integration test timeout to 15m
  • check shutdown signal in readChan sends to prevent hang
  • reduce parallel snapshot test data from 1M to 100k documents
  • retry Ping to tolerate replica set startup lag

Jira

  • CON-418
  • CON-419
  • CON-420
  • CON-428
  • CON-430
  • CON-452

The ack callback called stream.ResumeToken() from a different goroutine
than the one running stream.TryNext(), racing on the ChangeStream's
internal resume token field. Use the value already resolved by the
checkpoint system instead, which is both race-free and semantically
correct.

Fixes CON-418
stream.Run() returns nil on forced shutdown (benthos logs a warning
instead of returning an error). Changed RunAsyncWithErrors to RunAsync
and removed the unused helper method.

Fixes CON-419
StopNow calls StopWithin(0), which returns immediately without waiting
for the stream to stop. The stream goroutine later exits with
context.Canceled when test cleanup cancels t.Context(). The previous
fix (744e120d0) assumed stream.Run() returns nil on forced shutdown,
but it returns context.Canceled. Accept that as a normal shutdown error.

Fixes CON-419
Replace the first 5s fixed sleep with require.Eventually polling for
snapshot output, and reduce the second 5s sleep to 2s. The test was
timing out under the global 5-minute package budget because other slow
tests (1M-doc parallel snapshot, 35s connection timeout test) consumed
the allotted time first.

Fixes CON-420
In update_lookup mode, MongoDB sets fullDocument to null when the
document is deleted before the post-update lookup completes. The code
checked only key presence, not nil value, so a null fullDocument passed
through as a nil message. Now falls back to documentKey when
fullDocument is null for update events, matching the delete-event
fallback pattern.

Also adds a small sleep in the integration test between UpdateOne and
DeleteByID to prevent the race from making the test flaky.

Fixes CON-428
The default 5-minute timeout expired before all 58 tests in the
mongodb/cdc package could complete, causing a package-level panic.

Fixes CON-420
When the benthos read_until idle timeout fires, the framework stops
draining readChan before mongodb_cdc.Close() is called. During this
window, snapshot goroutines block on the readChan send with no reader
and no context cancellation. Add m.shutsig.SoftStopChan() to the
select statements in readSnapshotRange and readFromStream so goroutines
can exit immediately when shutdown begins.

Fixes CON-430
TestIntegrationMongoCDCWithParallelSnapshot was writing 1M documents and
taking ~136s, consuming most of the 5-minute go test timeout. Later tests
like TestIntegrationMongoCDCSchemaValidator had no time budget left and
were killed by the global timeout.

100k documents still exercises 8-worker parallel snapshot adequately
while reducing the test from ~136s to ~18s.

Fixes CON-420
The replica set can take a moment after container readiness before it
accepts client connections through the mapped port, causing intermittent
i/o timeouts. Wrap the initial Ping in EventuallyWithT so transient
failures during replica set initialization don't abort the test.

Fixes CON-452
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Commits
LGTM

Review
Targeted test-stability PR for the mongodb_cdc package: fixes a data race on ChangeStream resume tokens, adds a shutdown-signal escape to the readChan sends to unblock goroutines during forced shutdown, handles null fullDocument in update_lookup mode, and tightens the integration-test harness (timeout bump, sleep reductions, Ping retry, reduced snapshot test data size).

LGTM

This was referenced Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants